Spark: Lineage Graph

RDD is lazy in nature, Series of transformation applied on an RDD which is not evaluated immediately.While we create a new RDD from an existing Spark RDD, that new RDD also carries a pointer to the parent RDD in Spark. That is the same as all the dependencies between the RDDs those are logged in a graph, rather than the actual data. It is what we call as lineage graph. RDD lineage is nothing but the graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. To be very specific, it is an output of applying transformations to the spark. Then, it creates a logical execution plan.

In the process, we may lose any RDD as if any fault arises in a machine. By applying the same computation on that node, we can recover our same dataset again. We can apply the same computations by using lineage graph. Hence, This process is fault tolerance or self-recovery process.

val rdd1=sc.parallelize(1 to 10)
val rdd2=rdd1.map(x => x+2)
val rdd3=rdd2.filter(x => x > 10)

ToDebugString method to get RDD Lineage Graph in Spark
We can start reading Lineage Graphs from bottom to top. Hierarchy of RDD start bottom to top.

rdd3.toDebugString

Difference between Lineage graph and DAG

Lineage Graph	DAG
When a new RDD is derived from existing RDD using transformation, Spark keeps track of all the dependencies between these RDDs called the lineage graph.	It transform a logical execution plan into a physical execution plan,Once the Action has been called SparkContext hands over a logical plan to DAG scheduler that is translated into a set of stages of the jobs that are submitted as task-set for the execution
In case of data loss, this lineage graph is used to rebuild the data.	It is a scheduling layer of the apache spark that implements stage-oriented scheduling.
Lineage graph deals with RDDs so it is applicable up-till transformations	DAG shows the complete task, ie; trasnformation + Action.

Spark

Lineage Graph

No comments:

Post a Comment